15.6 Providing Clues to Discover Underlying CausesΒΆ

Question: what makes one representation better than another
One of the answers: one that disentangles the underlying causal factors of variation that generated the data, especially those factors that are relevent to our applications.

Most strategies of representation learning are based on introducing clues that help the learning find these underlying factors of variations. The clues can help the learner seperate these underlying factors from others.

  • Supervised learning provides a very strong clue: a label y that usually specifies the value of at least one of the factors of variation directly.
  • To make use of abundant unlabeled data, representation learning makes use of other, less direct hints. These hints tak form of implicit prior beliefs we impose in order to guide the learning.

Generic regularization strategies (to achieve better generalization)

  • Smoothness: This is the assumption that \(f(x+\epsilon d) \approx f(x)\) for unit d and small \(\epsilon\). Allows learners to generalize from training examples to nearby points in the input space. Insufficient to overcome the curse of dimensionality.
  • Linearity: Many learning algorithms assume that relationship between some variables are linear.
  • Multiple explanatory factors: assume that the data is generated by multple underlying explanatory factors and that most tasks can be solved easily given the state of each of these factors.
  • Causal factors: Treats the factors of variation described by the learned representation h as the causes of the observed data x, and not vice versa.
  • Depth or a hierarchical organization of explanatory factors: high-level, abstract concepts can be defined in terms of simple concepts, forming a hierarchy.
  • Shared factors accoss tasks
  • Natural Clustering: each connected manifold in the input space may be assinged to a single class.
  • Temproal and spatial coherence: Slow Feature Analysis assumes that the most important explanatory factors change slowly over time .
  • Sparcity: Most features should be presumably not relevant to describing most inputs.
  • Simplicity of factor dependency: In good high-level representations, the factors are related to each other through simple dependencies. This is assumed when plugging a linear predictor or a factorize prior on top of a learned representation.